Leontev Aleksandr, 4167

Heart Disease UCI Dataset

INSTALLATION, SETUP AND REQUIREMENTS

This project contains of two files:

  • project_heart_disease.ipynb - a notebook document used by Jupyter Notebook
  • input/hear.csv - dataset

So to use this project you should install Jupyter Notebook with Python environment contaning appropriate libraries.

Prerequisite: Python

While Jupyter runs code in many programming languages, Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook. In my project I used Python 3.6. I recommend using the Anaconda distribution to install Python and Jupyter. We’ll go through its installation in the next section.

Installing Jupyter using Anaconda and conda

For new users, it is highly recommended installing Anaconda. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.

  • Download Anaconda using following link https://www.anaconda.com/download. I recommend downloading Anaconda’s Python 3.6 version (currently Python 3.7).
  • Istall the version of Anaconda which you downloaded, following the instructions on the download page.
  • Congratulations, you have installed Jupyter Notebook.

Used libraries | Anaconda Environment Installation

You should install following used in project libraries using Anaconda - Environments

  • scikit-learn - A set of python modules for machine learning and data mining
  • pandas - High-performance, easy-to-use data structures and data analysis tools
  • numpy - Array processing for numbers, strings, records and objects
  • matplotlib - Publication quality Figures in python
  • seaborn - Statistical data visualization

Here you can see an example for scikit library.

Starting the Notebook Server

After you have installed the Jupyter Notebook on your computer, you are ready to run the notebook server. You can start the notebook server from the command line (using Terminal on Mac/Linux, Command Prompt on Windows) by running:

jupyter notebook

This will print some information about the notebook server in your terminal, including the URL of the web application (by default, http://localhost:8888):

jupyter notebook

[I 08:58:24.417 NotebookApp] Serving notebooks from local directory: /Users/alexander.leontev

[I 08:58:24.417 NotebookApp] 0 active kernels

[I 08:58:24.417 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/

[I 08:58:24.417 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

It will then open your default web browser to this URL. When the notebook opens in your browser, you will see the Notebook Dashboard, which will show a list of the notebooks, files, and subdirectories in the directory where the notebook server was started. Most of the time, you will wish to start a notebook server in the highest level directory containing notebooks. Often this will be your home directory.

So now use can use this project opening file project_heart_disease.ipynb and run codes exploring something interesting.

INTRODUCTION

https://archive.ics.uci.edu/ml/datasets/Heart+Disease

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

DATASET COLUMNS FEATURE EXPLAIN

  • Age (age in years)
  • Sex (1 = male; 0 = female)
  • CP (chest pain type)
  • TRESTBPS (resting blood pressure (in mm Hg on admission to the hospital))
  • CHOL (serum cholestoral in mg/dl)
  • FPS (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • RESTECH (resting electrocardiographic results)
  • THALACH (maximum heart rate achieved)
  • EXANG (exercise induced angina (1 = yes; 0 = no))
  • OLDPEAK (ST depression induced by exercise relative to rest)
  • SLOPE (the slope of the peak exercise ST segment)
  • CA (number of major vessels (0-3) colored by flourosopy)
  • THAL (3 = normal; 6 = fixed defect; 7 = reversable defect)
  • TARGET (1 or 0)

INVESTIGATING THE DATA

Firstly, we should import all the libraries that we will use in our application. All necessary Python modules imports are shown below:

In [949]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.model_selection import GridSearchCV,train_test_split,cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc
import os
import warnings
In [950]:
warnings.filterwarnings('ignore')

Let's uplaod our data set to the data variable using the read_csv function in the pandas library.

In [951]:
print(os.listdir("./input"))
['heart.csv']
In [952]:
data = pd.read_csv('./input/heart.csv')

data = data.sample(frac=1)
In [953]:
# Now, our data is loaded. We're writing the following snippet to see the loaded data. The purpose here is to see the top five of the loaded data.

print('Data First 5 Rows Show\n')
data.head()
Data First 5 Rows Show

Out[953]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
207 60 0 0 150 258 0 0 157 0 2.6 1 2 3 0
296 63 0 0 124 197 0 1 136 1 0.0 1 0 2 0
116 41 1 2 130 214 0 0 168 0 2.0 1 0 2 1
62 52 1 3 118 186 0 0 190 0 0.0 1 0 1 1
255 45 1 0 142 309 0 0 147 1 0.0 1 3 3 0
In [954]:
print('Data Last 5 Rows Show\n')
data.tail()
Data Last 5 Rows Show

Out[954]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
290 61 1 0 148 203 0 1 161 0 0.0 2 1 3 0
135 49 0 0 130 269 0 1 163 0 0.0 2 0 2 1
88 54 0 2 110 214 0 1 158 0 1.6 1 0 2 1
53 44 0 2 108 141 0 1 175 0 0.6 1 0 2 1
22 42 1 0 140 226 0 1 178 0 0.0 2 0 2 1

Both the head() and tail() functions have a value of 5 by default. Different values should be given as parameters to change these values.

In [955]:
# How many rows and columns are there for all data?
print('Data Shape Show\n')
data.shape  #first one is rows, other is columns
Data Shape Show

Out[955]:
(303, 14)
In [956]:
print('Data Show Info\n')
data.info()
Data Show Info

<class 'pandas.core.frame.DataFrame'>
Int64Index: 303 entries, 207 to 22
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 35.5 KB
In [957]:
# Now, we will check null on all data and If data has null, I will sum of null data's. In this way, how many missing data is in the data.
print('Data Sum of Null Values \n')
data.isnull().sum()
Data Sum of Null Values 

Out[957]:
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64
In [958]:
# All rows control for null values
data.isnull().values.any()
Out[958]:
False

So, there is no missing data in the database

In [959]:
print('Data Show Describe\n')
data.describe()
Data Show Describe

Out[959]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.366337 0.683168 0.966997 131.623762 246.264026 0.148515 0.528053 149.646865 0.326733 1.039604 1.399340 0.729373 2.313531 0.544554
std 9.082101 0.466011 1.032052 17.538143 51.830751 0.356198 0.525860 22.905161 0.469794 1.161075 0.616226 1.022606 0.612277 0.498835
min 29.000000 0.000000 0.000000 94.000000 126.000000 0.000000 0.000000 71.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 47.500000 0.000000 0.000000 120.000000 211.000000 0.000000 0.000000 133.500000 0.000000 0.000000 1.000000 0.000000 2.000000 0.000000
50% 55.000000 1.000000 1.000000 130.000000 240.000000 0.000000 1.000000 153.000000 0.000000 0.800000 1.000000 0.000000 2.000000 1.000000
75% 61.000000 1.000000 2.000000 140.000000 274.500000 0.000000 1.000000 166.000000 1.000000 1.600000 2.000000 1.000000 3.000000 1.000000
max 77.000000 1.000000 3.000000 200.000000 564.000000 1.000000 2.000000 202.000000 1.000000 6.200000 2.000000 4.000000 3.000000 1.000000

The features described in the above data set are:

  1. Count tells us the number of NoN-empty rows in a feature.

  2. Mean tells us the mean value of that feature.

  3. Std tells us the Standard Deviation Value of that feature.

  4. Min tells us the minimum value of that feature.

  5. 25%, 50%, and 75% are the percentile/quartile of each features.

  6. Max tells us the maximum value of that feature.

Observing the target as it is the output column which shows us if the patients had the heart dieases or not. Here we are observing target values and we can observe that we have more people with cases having heart dieases then people not having heart dieases in our data.

In [960]:
fig, ax = plt.subplots(figsize=(5, 8))
sns.countplot(data['target'])
plt.title('Target values')
Out[960]:
Text(0.5, 1.0, 'Target values')

Now lets see how many male and female in our data

In [961]:
male =len(data[data['sex'] == 1])
female = len(data[data['sex']== 0])

plt.figure(figsize=(8,6))

# Data to plot
labels = 'Male', 'Female'
sizes = [male, female]
colors = ['skyblue', 'yellowgreen']
explode = (0, 0)  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=90)
 
plt.axis('equal')
plt.show()

Fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

In [962]:
plt.figure(figsize=(8,6))

# Data to plot
labels = 'fasting blood sugar < 120 mg/dl','fasting blood sugar > 120 mg/dl'
sizes = [len(data[data['fbs'] == 0]),len(data[data['cp'] == 1])]
colors = ['skyblue', 'yellowgreen','orange','gold']
explode = (0.1, 0)  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=180)
 
plt.axis('equal')
plt.show()

Now below I have shown distribution of age over database.

In [963]:
sns.distplot(data['age'])
Out[963]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fee44a710>

Here we are observing the bar plot with the belive that different type of chest pain causes different type of targets in male and female. We can see that cp type 4 is more responsible in female while chest pain type 2 is more responsible in male.

In [964]:
sns.set(font_scale=2)
fig, ax = plt.subplots(figsize=(20, 10))
plt.title('Target values with different cp in male and female')
sns.barplot(x='sex', y='target', hue='cp', data=data)
Out[964]:
<matplotlib.axes._subplots.AxesSubplot at 0x20feeae4e10>

Below i have shown a plot by samoling the age feature in some parts to observe the effect of excersize on the maximum heart rate achieved in different agegroup. We can see that here the maximum heart beat decreases as the age increases but it further decreases as the patient has excersize induced angina. by this we can say that by achieving lower maximum heart beat we can predict and save ourselves from excersize induced angina.

In [965]:
ages = ['age']
bins = [29, 35, 45, 55, 65, 77]
labels = [ '25-40','41-50', '51-60', '61-70', '70+']
data['agerange'] = pd.cut(data.age, bins, labels = labels, include_lowest = True)
fig, ax = plt.subplots(figsize=(20, 10))
plt.title('Maximum heart rate as the agerange with and without excersize induced angina ')
sns.barplot(x = 'agerange', y = 'thalach', hue = 'exang', data = data)
data = data.drop(["agerange"] , axis=1)

Plotting the distribution of various features

Thalach: maximum heart rate achieved

In [966]:
sns.distplot(data['thalach'], kde = False, bins = 30, color = 'violet')
Out[966]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fee31f710>

Chol: serum cholestoral in mg/dl

In [967]:
sns.distplot(data['chol'], kde = False, bins = 30, color = 'red')
plt.show()

Trestbps: resting blood pressure (in mm Hg on admission to the hospital)

In [968]:
sns.distplot(data['trestbps'], kde = False, bins = 30, color = 'blue')
plt.show()

Number of people who have heart disease according to age

In [969]:
plt.figure(figsize = (30, 12))
sns.countplot(x = 'age', data = data, hue = 'target', palette = 'GnBu')
plt.show()

Heatmap correlation

Below I have shown a correlation matrix to show if there is any corelelation between different features. And we can say that there is not any serious corelation between features.

Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. There are several types of correlation coefficient: Pearson’s correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear regression.

Pearson's correlation

The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the strength of a linear association between two variables and is denoted by r. Basically, a Pearson product-moment correlation attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit).

Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For example, in the stock market, if we want to measure how two stocks are related to each other, Pearson r correlation is used to measure the degree of relationship between the two. The point-biserial correlation is conducted with the Pearson correlation formula except that one of the variables is dichotomous. The following formula is used to calculate the Pearson r correlation:

The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association; that is, as the value of one variable increases, the value of the other variable decreases. This is shown in the diagram below:

  • rxy = Pearson r correlation coefficient between x and y
  • n = number of observations
  • xi = value of x (for ith observation)
  • yi = value of y (for ith observation)

The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or negative, respectively. Achieving a value of +1 or -1 means that all your data points are included on the line of best fit – there are no data points that show any variation away from this line. Values for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation around the line of best fit. The closer the value of r to 0 the greater the variation around the line of best fit. Different relationships and their correlation coefficients are shown in the diagram below:

In [970]:
plt.figure(figsize=(40, 20))
sns.heatmap(data.corr(method="pearson"), annot = True, cmap='coolwarm', linewidths=.1)
plt.show()

Spearman's correlation

Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.

The following formula is used to calculate the Spearman rank correlation:

  • ρ = Spearman rank correlation
  • di= the difference between the ranks of corresponding variables
  • n = number of observations
In [971]:
plt.figure(figsize=(40, 20))
sns.heatmap(data.corr(method="spearman"), annot = True, cmap='coolwarm', linewidths=.1)
plt.show()

Kendall's correlation

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the ordinal association between two measured quantities. A tau test is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.

It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

Intuitively, the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of −1) rank between the two variables.

Both Kendall's and Spearman's can be formulated as special cases of a more general correlation coefficient.

Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables. If we consider two samples, a and b, where each sample size is n, we know that the total number of pairings with a b is n(n-1)/2. The following formula is used to calculate the value of Kendall rank correlation:

  • Nc = number of concordant
  • Nd = Number of discordant
In [972]:
plt.figure(figsize=(40, 20))
sns.heatmap(data.corr(method="kendall"), annot = True, cmap='coolwarm', linewidths=.1)
plt.show()

As result we can see that there is not any serious corelation between features.

CLASSIFICATION: MODEL, TRAINING and TESTING

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance. Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or features. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure). Other classifiers work by comparing observations to previous observations by means of a similarity or distance function. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category. Terminology across fields is quite varied. In statistics, where classification is often done with logistic regression or a similar procedure, the properties of observations are termed explanatory variables (or independent variables, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to be possible values of the dependent variable. In machine learning, the observations are often known as instances, the explanatory variables are termed features (grouped into a feature vector), and the possible categories to be predicted are classes. Other fields may use different terminology: e.g. in community ecology, the term "classification" normally refers to cluster analysis, i.e., a type of unsupervised learning, rather than the supervised learning described in this article.

We will implement, research and compare 5 algorithms of classification:

  • Logistic Regression
  • Decision Trees
  • Random Forest
  • K-Nearest Neighbors
  • Naive Bayes
  • C-Support Vector Classification

Logistic Regression

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. Logistic regression is implemented in LogisticRegression. This implementation can fit binary, One-vs-Rest, or multinomial logistic regression with optional , or Elastic-Net regularization.

Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.

Some advantages of decision trees are:

  • Simple to understand and to interpret. Trees can be visualised.
  • Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
  • The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
  • Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
  • Able to handle multi-output problems.
  • Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
  • Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
  • Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
The disadvantages of decision trees include:
  • Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
  • The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
  • There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Random Forest

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features. (See the parameter tuning guidelines for more details). The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model. In contrast to the original publication [B2001], the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

K-Nearest Neighbors

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point. scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier implements learning based on the k-nearest neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user. The k-neighbors classification in KNeighborsClassifier is the most commonly used technique. The optimal choice of the value k is highly data-dependent: in general a larger k suppresses the effects of noise, but makes the classification boundaries less distinct. In cases where the data is not uniformly sampled, radius-based neighbors classification in RadiusNeighborsClassifier can be a better choice. The user specifies a fixed radiusr , such that points in sparser neighborhoods use fewer nearest neighbors for the classification. For high-dimensional parameter spaces, this method becomes less effective due to the so-called “curse of dimensionality”. The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors such that nearer neighbors contribute more to the fit. This can be accomplished through the weights keyword. The default value, weights = 'uniform', assigns uniform weights to each neighbor. weights = 'distance' assigns weights proportional to the inverse of the distance from the query point. Alternatively, a user-defined function of the distance can be supplied to compute the weights.

Gaussian Naive Bayes

GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:

C-Support Vector Classification

The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using sklearn.linear_model.LinearSVC or sklearn.linear_model.SGDClassifier instead, possibly after a sklearn.kernel_approximation.Nystroem transformer. The multiclass support is handled according to a one-vs-one scheme. For details on the precise mathematical formulation of the provided kernel functions and how gamma, coef0 and degree affect each other, see the corresponding section in the narrative documentation: Kernel functions.

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

  • Effective in high dimensional spaces.
  • Still effective in cases where number of dimensions is greater than the number of samples.
  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
  • Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
  • If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
  • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).
The support vector machines in scikit-learn support both dense (numpy.ndarray and convertible to that by numpy.asarray) and sparse (any scipy.sparse) sample vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. For optimal performance, use C-ordered numpy.ndarray (dense) or scipy.sparse.csr_matrix (sparse) with dtype=float64.

Implementing and testing on the dataset

Firstly, we should split the dataset into training and test set

In [973]:
y = data["target"].values
x = data.drop(["target"] , axis = 1)

Preprocessing - Scaling the features

In [974]:
# Normalization
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(x)
In [975]:
# Train test split
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.2 , random_state = 0)

Trying different algorithms using Scikit-learn (formerly scikits.learn) librart. It is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

In [976]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train, y_train)
Out[976]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

We can see the resulted linear equation:

In [977]:
print("Logistic Regression linear equation:")

y_string = "y = "0
for i in range(13):
    y_string += str(round(lr.coef_[0][i], 4))+"*x"+str(i)+" + "

y_string += str(round(lr.intercept_[0], 4));
print(y_string)
  File "<ipython-input-977-9b6ab1a35ef4>", line 3
    y_string = "y = "0
                     ^
SyntaxError: invalid syntax
In [978]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)

plt.figure(figsize=(35, 20))
tree.plot_tree(dt)

# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
rf.fit(x_train, y_train)

# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(x_train, y_train)

# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train, y_train)

# C-Support Vector Classification
from sklearn.svm import SVC
svm = SVC(random_state = 1)
svm.fit(x_train, y_train)
Out[978]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=1,
    shrinking=True, tol=0.001, verbose=False)
In [979]:
print("Logistic Regression Score : {}".format(lr.score(x_test, y_test)))
print("Decision Tree Score ..... : {}".format(dt.score(x_test, y_test)))
print("Random Forest Score ..... : {}".format(rf.score(x_test, y_test)))
print("Knn Score ............... : {}".format(knn.score(x_test, y_test)))
print("Naive Bayes Score ....... : {}".format(nb.score(x_test, y_test)))
print("SVM Score ............... : {}".format(svm.score(x_test, y_test)))
Logistic Regression Score : 0.819672131147541
Decision Tree Score ..... : 0.7704918032786885
Random Forest Score ..... : 0.8688524590163934
Knn Score ............... : 0.8524590163934426
Naive Bayes Score ....... : 0.9180327868852459
SVM Score ............... : 0.8688524590163934

Now lets try different test size as it will have a great impact on accuracy of algorithms.

In [980]:
# Define test sizes

test_size = [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6]
lr_acc = [None] * len(test_size)
dt_acc = [None] * len(test_size)
rf_acc = [None] * len(test_size)
knn_acc = [None] * len(test_size)
nb_acc = [None] * len(test_size)
svm_acc = [None] * len(test_size)

Logistic regression with different test sizes

In [981]:
print("Logistic Regression\n")

j = 0

for i in test_size:
    x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
    clf = LogisticRegression()
    clf.fit(x_train, y_train)
    clf.predict(x_test)
    accuracy = clf.score(x_test,y_test)
    
    lr_acc[j] = accuracy
    j = j + 1
   
    print("Accuracy for test size", i, "is", accuracy)
    
print("\n------------------------------------------------")
    
    
Logistic Regression

Accuracy for test size 0.05 is 0.9375
Accuracy for test size 0.1 is 0.8064516129032258
Accuracy for test size 0.15 is 0.8043478260869565
Accuracy for test size 0.2 is 0.819672131147541
Accuracy for test size 0.25 is 0.8421052631578947
Accuracy for test size 0.3 is 0.8461538461538461
Accuracy for test size 0.35 is 0.8411214953271028
Accuracy for test size 0.4 is 0.8524590163934426
Accuracy for test size 0.45 is 0.8613138686131386
Accuracy for test size 0.5 is 0.8289473684210527
Accuracy for test size 0.55 is 0.8203592814371258
Accuracy for test size 0.6 is 0.8241758241758241

------------------------------------------------

Decision Tree with different test sizes

In [982]:
print("Decision Tree\n")

j = 0

for i in test_size:
    x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
    dt = DecisionTreeClassifier()
    dt.fit(x_train, y_train)
    dt.predict(x_test)
    accuracy = dt.score(x_test,y_test)
    
    dt_acc[j] = accuracy
    j = j + 1
   
    print("Accuracy for test size", i, "is", accuracy)
    
print("\n------------------------------------------------")
Decision Tree

Accuracy for test size 0.05 is 0.9375
Accuracy for test size 0.1 is 0.8064516129032258
Accuracy for test size 0.15 is 0.7391304347826086
Accuracy for test size 0.2 is 0.7213114754098361
Accuracy for test size 0.25 is 0.75
Accuracy for test size 0.3 is 0.6703296703296703
Accuracy for test size 0.35 is 0.719626168224299
Accuracy for test size 0.4 is 0.7704918032786885
Accuracy for test size 0.45 is 0.7445255474452555
Accuracy for test size 0.5 is 0.6907894736842105
Accuracy for test size 0.55 is 0.7305389221556886
Accuracy for test size 0.6 is 0.7362637362637363

------------------------------------------------

Random Forest with different test sizes

In [983]:
print("Random Forest\n")

j = 0

for i in test_size:
    x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
    rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
    rf.fit(x_train, y_train)
    rf.predict(x_test)
    accuracy = rf.score(x_test,y_test)
    
    rf_acc[j] = accuracy
    j = j + 1
   
    print("Accuracy for test size", i, "is", accuracy)
    
print("\n------------------------------------------------")
Random Forest

Accuracy for test size 0.05 is 0.9375
Accuracy for test size 0.1 is 0.8387096774193549
Accuracy for test size 0.15 is 0.782608695652174
Accuracy for test size 0.2 is 0.8688524590163934
Accuracy for test size 0.25 is 0.8157894736842105
Accuracy for test size 0.3 is 0.8681318681318682
Accuracy for test size 0.35 is 0.8130841121495327
Accuracy for test size 0.4 is 0.8278688524590164
Accuracy for test size 0.45 is 0.8394160583941606
Accuracy for test size 0.5 is 0.8223684210526315
Accuracy for test size 0.55 is 0.8083832335329342
Accuracy for test size 0.6 is 0.8186813186813187

------------------------------------------------

KNeighbor with different test sizes

In [984]:
print("KNeighbors\n")

j = 0

for i in test_size:
    x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
    knn = KNeighborsClassifier(n_neighbors = 4)
    knn.fit(x_train, y_train)
    knn.predict(x_test)
    accuracy = knn.score(x_test,y_test)
    
    knn_acc[j] = accuracy
    j = j + 1
   
    print("Accuracy for test size", i, "is", accuracy)
    
print("\n------------------------------------------------")
KNeighbors

Accuracy for test size 0.05 is 0.8125
Accuracy for test size 0.1 is 0.7741935483870968
Accuracy for test size 0.15 is 0.8478260869565217
Accuracy for test size 0.2 is 0.8524590163934426
Accuracy for test size 0.25 is 0.8552631578947368
Accuracy for test size 0.3 is 0.8351648351648352
Accuracy for test size 0.35 is 0.8411214953271028
Accuracy for test size 0.4 is 0.8524590163934426
Accuracy for test size 0.45 is 0.8467153284671532
Accuracy for test size 0.5 is 0.8421052631578947
Accuracy for test size 0.55 is 0.8263473053892215
Accuracy for test size 0.6 is 0.8296703296703297

------------------------------------------------

Naive Bayes with different test sizes

In [985]:
print("Naive Bayes\n")

j = 0

for i in test_size:
    x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
    nb = GaussianNB()
    nb.fit(x_train, y_train)
    nb.predict(x_test)
    accuracy = nb.score(x_test, y_test)
    
    nb_acc[j] = accuracy
    j = j + 1
   
    print("Accuracy for test size", i, "is", accuracy)
    
print("\n------------------------------------------------")
Naive Bayes

Accuracy for test size 0.05 is 1.0
Accuracy for test size 0.1 is 0.9354838709677419
Accuracy for test size 0.15 is 0.8913043478260869
Accuracy for test size 0.2 is 0.9180327868852459
Accuracy for test size 0.25 is 0.9078947368421053
Accuracy for test size 0.3 is 0.8571428571428571
Accuracy for test size 0.35 is 0.8504672897196262
Accuracy for test size 0.4 is 0.8278688524590164
Accuracy for test size 0.45 is 0.8394160583941606
Accuracy for test size 0.5 is 0.7697368421052632
Accuracy for test size 0.55 is 0.8083832335329342
Accuracy for test size 0.6 is 0.8131868131868132

------------------------------------------------

SVM with different test sizes

In [986]:
print("SVM\n")

j = 0

for i in test_size:
    x_train, x_test , y_train , y_test = train_test_split(x , y , test_size = i , random_state = 0)
    svm = SVC(random_state = 1)
    svm.fit(x_train, y_train)
    svm.predict(x_test)
    accuracy = svm.score(x_test,y_test)
    
    svm_acc[j] = accuracy
    j = j + 1
   
    print("Accuracy for test size", i, "is", accuracy)
    
print("\n------------------------------------------------")
SVM

Accuracy for test size 0.05 is 0.9375
Accuracy for test size 0.1 is 0.8064516129032258
Accuracy for test size 0.15 is 0.8478260869565217
Accuracy for test size 0.2 is 0.8688524590163934
Accuracy for test size 0.25 is 0.8552631578947368
Accuracy for test size 0.3 is 0.8461538461538461
Accuracy for test size 0.35 is 0.8317757009345794
Accuracy for test size 0.4 is 0.8770491803278688
Accuracy for test size 0.45 is 0.8540145985401459
Accuracy for test size 0.5 is 0.8355263157894737
Accuracy for test size 0.55 is 0.8323353293413174
Accuracy for test size 0.6 is 0.8406593406593407

------------------------------------------------

Visualize results for different test sizes and algorithms

In [987]:
plt.figure(figsize=(30, 15))

plt.plot(test_size, lr_acc, label='Logistic Regression')
plt.plot(test_size, rf_acc, label='Random Forest')
plt.plot(test_size, dt_acc, label='Decision Tree')
plt.plot(test_size, knn_acc, label='KNN')
plt.plot(test_size, nb_acc, label='Naive Bayes')
plt.plot(test_size, svm_acc, label='SVM')


plt.xlabel('Test size')
plt.ylabel('Accurancy')
plt.title('')
plt.legend()
plt.show()

Let's test different algorithms on different batch sizes. We will compare accuracy and balanced F-score metrics.

f1_score function compute the F1 score, also known as balanced F-score or F-measure

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: F1 = 2 (precision recall) / (precision + recall)

In [988]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

acc_list_lr_acc = []
acc_list_lr_f1 = []

acc_list_dt_acc = []
acc_list_dt_f1 = []

acc_list_rf_acc = []
acc_list_rf_f1 = []

acc_list_knn_acc = []
acc_list_knn_f1 = []

acc_list_nb_acc = []
acc_list_nb_f1 = []

acc_list_svm_acc = []
acc_list_svm_f1 = []

train_batch = [5, 10, 15, 25, 50, 75, 100, 125, 150, 175, 200, 250]

for train_size in train_batch:
    
    train_data_x = x[:train_size]
    test_data_x = x[train_size:train_size + 50]
    
    train_data_y = y[:train_size]
    test_data_y = y[train_size:train_size + 50]

    X_train = train_data_x
    Y_train = train_data_y
    X_test = test_data_x
    Y_test = test_data_y 

    # Logistic Regression
    lr = LogisticRegression()
    lr.fit(X_train, Y_train)
    
    acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
    acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))

    # Decision Tree
    from sklearn.tree import DecisionTreeClassifier
    dt = DecisionTreeClassifier()
    dt.fit(X_train, Y_train)
    
    acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
    acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))

    # Random Forest
    from sklearn.ensemble import RandomForestClassifier
    rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
    rf.fit(X_train, Y_train)
    
    acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
    acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))

    # K-Nearest Neighbors
    from sklearn.neighbors import KNeighborsClassifier
    knn = KNeighborsClassifier(n_neighbors = 4)
    knn.fit(X_train, Y_train)
    
    acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
    acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))

    # Gaussian Naive Bayes
    from sklearn.naive_bayes import GaussianNB
    nb = GaussianNB()
    nb.fit(X_train, Y_train)
    
    acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
    acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))

    # C-Support Vector Classification
    from sklearn.svm import SVC
    svm = SVC(random_state = 1)
    svm.fit(X_train, Y_train)
    
    acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
    acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))   
    
    print("Train with train size", train_size)
    
    print("Logistic Regression F-measure binary {:3.4f}".format(f1_score(lr.predict(X_test), Y_test, average='binary')))
    print("Logistic Regression Calculated accuracy: {:3.4f}%\n".format(accuracy_score(lr.predict(X_test),Y_test)*100))
    
    print("Decision Tree F-measure binary {:3.4f}".format(f1_score(dt.predict(X_test), Y_test, average='binary')))
    print("Decision Tree Calculated accuracy: {:3.4f}%\n".format(accuracy_score(dt.predict(X_test),Y_test)*100))
    
    print("Random Forest F-measure binary {:3.4f}".format(f1_score(rf.predict(X_test), Y_test, average='binary')))
    print("Random Forest Calculated accuracy: {:3.4f}%\n".format(accuracy_score(rf.predict(X_test),Y_test)*100))
    
    print("K-Nearest Neighbors F-measure binary {:3.4f}".format(f1_score(knn.predict(X_test), Y_test, average='binary')))
    print("K-Nearest Neighbors Calculated accuracy: {:3.4f}%\n".format(accuracy_score(knn.predict(X_test),Y_test)*100))
    
    print("Gaussian Naive Bayes F-measure binary {:3.4f}".format(f1_score(nb.predict(X_test), Y_test, average='binary')))
    print("Gaussian Naive Bayes Calculated accuracy: {:3.4f}%\n".format(accuracy_score(nb.predict(X_test),Y_test)*100))
    
    print("C-Support Vector Classification F-measure binary {:3.4f}".format(f1_score(svm.predict(X_test), Y_test, average='binary')))
    print("C-Support Vector Classification Calculated accuracy: {:3.4f}%\n".format(accuracy_score(svm.predict(X_test),Y_test)*100))
    
    print("-----------------------------------------------------------------")
Train with train size 5
Logistic Regression F-measure binary 0.6800
Logistic Regression Calculated accuracy: 68.0000%

Decision Tree F-measure binary 0.7857
Decision Tree Calculated accuracy: 76.0000%

Random Forest F-measure binary 0.5500
Random Forest Calculated accuracy: 64.0000%

K-Nearest Neighbors F-measure binary 0.0000
K-Nearest Neighbors Calculated accuracy: 44.0000%

Gaussian Naive Bayes F-measure binary 0.2353
Gaussian Naive Bayes Calculated accuracy: 48.0000%

C-Support Vector Classification F-measure binary 0.3333
C-Support Vector Classification Calculated accuracy: 52.0000%

-----------------------------------------------------------------
Train with train size 10
Logistic Regression F-measure binary 0.7170
Logistic Regression Calculated accuracy: 70.0000%

Decision Tree F-measure binary 0.6400
Decision Tree Calculated accuracy: 64.0000%

Random Forest F-measure binary 0.5957
Random Forest Calculated accuracy: 62.0000%

K-Nearest Neighbors F-measure binary 0.6122
K-Nearest Neighbors Calculated accuracy: 62.0000%

Gaussian Naive Bayes F-measure binary 0.3243
Gaussian Naive Bayes Calculated accuracy: 50.0000%

C-Support Vector Classification F-measure binary 0.5333
C-Support Vector Classification Calculated accuracy: 58.0000%

-----------------------------------------------------------------
Train with train size 15
Logistic Regression F-measure binary 0.6296
Logistic Regression Calculated accuracy: 60.0000%

Decision Tree F-measure binary 0.6441
Decision Tree Calculated accuracy: 58.0000%

Random Forest F-measure binary 0.7667
Random Forest Calculated accuracy: 72.0000%

K-Nearest Neighbors F-measure binary 0.7273
K-Nearest Neighbors Calculated accuracy: 70.0000%

Gaussian Naive Bayes F-measure binary 0.7302
Gaussian Naive Bayes Calculated accuracy: 66.0000%

C-Support Vector Classification F-measure binary 0.6538
C-Support Vector Classification Calculated accuracy: 64.0000%

-----------------------------------------------------------------
Train with train size 25
Logistic Regression F-measure binary 0.7586
Logistic Regression Calculated accuracy: 72.0000%

Decision Tree F-measure binary 0.6667
Decision Tree Calculated accuracy: 60.0000%

Random Forest F-measure binary 0.7869
Random Forest Calculated accuracy: 74.0000%

K-Nearest Neighbors F-measure binary 0.7407
K-Nearest Neighbors Calculated accuracy: 72.0000%

Gaussian Naive Bayes F-measure binary 0.8475
Gaussian Naive Bayes Calculated accuracy: 82.0000%

C-Support Vector Classification F-measure binary 0.7719
C-Support Vector Classification Calculated accuracy: 74.0000%

-----------------------------------------------------------------
Train with train size 50
Logistic Regression F-measure binary 0.8136
Logistic Regression Calculated accuracy: 78.0000%

Decision Tree F-measure binary 0.8519
Decision Tree Calculated accuracy: 84.0000%

Random Forest F-measure binary 0.8065
Random Forest Calculated accuracy: 76.0000%

K-Nearest Neighbors F-measure binary 0.7308
K-Nearest Neighbors Calculated accuracy: 72.0000%

Gaussian Naive Bayes F-measure binary 0.7869
Gaussian Naive Bayes Calculated accuracy: 74.0000%

C-Support Vector Classification F-measure binary 0.8000
C-Support Vector Classification Calculated accuracy: 76.0000%

-----------------------------------------------------------------
Train with train size 75
Logistic Regression F-measure binary 0.8485
Logistic Regression Calculated accuracy: 80.0000%

Decision Tree F-measure binary 0.7333
Decision Tree Calculated accuracy: 68.0000%

Random Forest F-measure binary 0.8125
Random Forest Calculated accuracy: 76.0000%

K-Nearest Neighbors F-measure binary 0.8070
K-Nearest Neighbors Calculated accuracy: 78.0000%

Gaussian Naive Bayes F-measure binary 0.8788
Gaussian Naive Bayes Calculated accuracy: 84.0000%

C-Support Vector Classification F-measure binary 0.8235
C-Support Vector Classification Calculated accuracy: 76.0000%

-----------------------------------------------------------------
Train with train size 100
Logistic Regression F-measure binary 0.8679
Logistic Regression Calculated accuracy: 86.0000%

Decision Tree F-measure binary 0.8400
Decision Tree Calculated accuracy: 84.0000%

Random Forest F-measure binary 0.8571
Random Forest Calculated accuracy: 84.0000%

K-Nearest Neighbors F-measure binary 0.8462
K-Nearest Neighbors Calculated accuracy: 84.0000%

Gaussian Naive Bayes F-measure binary 0.8421
Gaussian Naive Bayes Calculated accuracy: 82.0000%

C-Support Vector Classification F-measure binary 0.8421
C-Support Vector Classification Calculated accuracy: 82.0000%

-----------------------------------------------------------------
Train with train size 125
Logistic Regression F-measure binary 0.8085
Logistic Regression Calculated accuracy: 82.0000%

Decision Tree F-measure binary 0.7234
Decision Tree Calculated accuracy: 74.0000%

Random Forest F-measure binary 0.8163
Random Forest Calculated accuracy: 82.0000%

K-Nearest Neighbors F-measure binary 0.7755
K-Nearest Neighbors Calculated accuracy: 78.0000%

Gaussian Naive Bayes F-measure binary 0.7755
Gaussian Naive Bayes Calculated accuracy: 78.0000%

C-Support Vector Classification F-measure binary 0.8000
C-Support Vector Classification Calculated accuracy: 80.0000%

-----------------------------------------------------------------
Train with train size 150
Logistic Regression F-measure binary 0.7843
Logistic Regression Calculated accuracy: 78.0000%

Decision Tree F-measure binary 0.8077
Decision Tree Calculated accuracy: 80.0000%

Random Forest F-measure binary 0.7925
Random Forest Calculated accuracy: 78.0000%

K-Nearest Neighbors F-measure binary 0.7083
K-Nearest Neighbors Calculated accuracy: 72.0000%

Gaussian Naive Bayes F-measure binary 0.8077
Gaussian Naive Bayes Calculated accuracy: 80.0000%

C-Support Vector Classification F-measure binary 0.8148
C-Support Vector Classification Calculated accuracy: 80.0000%

-----------------------------------------------------------------
Train with train size 175
Logistic Regression F-measure binary 0.8621
Logistic Regression Calculated accuracy: 84.0000%

Decision Tree F-measure binary 0.7308
Decision Tree Calculated accuracy: 72.0000%

Random Forest F-measure binary 0.8475
Random Forest Calculated accuracy: 82.0000%

K-Nearest Neighbors F-measure binary 0.7308
K-Nearest Neighbors Calculated accuracy: 72.0000%

Gaussian Naive Bayes F-measure binary 0.8727
Gaussian Naive Bayes Calculated accuracy: 86.0000%

C-Support Vector Classification F-measure binary 0.8525
C-Support Vector Classification Calculated accuracy: 82.0000%

-----------------------------------------------------------------
Train with train size 200
Logistic Regression F-measure binary 0.9000
Logistic Regression Calculated accuracy: 88.0000%

Decision Tree F-measure binary 0.7719
Decision Tree Calculated accuracy: 74.0000%

Random Forest F-measure binary 0.8136
Random Forest Calculated accuracy: 78.0000%

K-Nearest Neighbors F-measure binary 0.8276
K-Nearest Neighbors Calculated accuracy: 80.0000%

Gaussian Naive Bayes F-measure binary 0.9310
Gaussian Naive Bayes Calculated accuracy: 92.0000%

C-Support Vector Classification F-measure binary 0.9000
C-Support Vector Classification Calculated accuracy: 88.0000%

-----------------------------------------------------------------
Train with train size 250
Logistic Regression F-measure binary 0.8077
Logistic Regression Calculated accuracy: 80.0000%

Decision Tree F-measure binary 0.7719
Decision Tree Calculated accuracy: 74.0000%

Random Forest F-measure binary 0.8077
Random Forest Calculated accuracy: 80.0000%

K-Nearest Neighbors F-measure binary 0.7451
K-Nearest Neighbors Calculated accuracy: 74.0000%

Gaussian Naive Bayes F-measure binary 0.7547
Gaussian Naive Bayes Calculated accuracy: 74.0000%

C-Support Vector Classification F-measure binary 0.8421
C-Support Vector Classification Calculated accuracy: 82.0000%

-----------------------------------------------------------------
In [989]:
plt.figure(figsize=(35, 20))

plt.title("Dependency of accuracy on batch size")

plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
plt.plot(train_batch, acc_list_knn_acc, label='KNN')
plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
plt.plot(train_batch, acc_list_svm_acc, label='SVM')


plt.xlabel('Batch size')
plt.ylabel('Accurancy')
plt.title('')
plt.legend()
plt.show()
In [ ]:
 
In [990]:
plt.figure(figsize=(35, 20))

plt.title("Dependency of F-measure on batch size")

plt.plot(train_batch, acc_list_lr_f1, label='Logistic Regression')
plt.plot(train_batch, acc_list_rf_f1, label='Random Forest')
plt.plot(train_batch, acc_list_dt_f1, label='Decision Tree')
plt.plot(train_batch, acc_list_knn_f1, label='KNN')
plt.plot(train_batch, acc_list_nb_f1, label='Naive Bayes')
plt.plot(train_batch, acc_list_svm_f1, label='SVM')


plt.xlabel('Batch size')
plt.ylabel('Accurancy')
plt.title('')
plt.legend()
plt.show()

Result

As we can see from figure, the best algorithm is Logistic Regression. It shows an increase in almost the entire segment. We can say that the batch size of 125 is enough to get accuracy about 90%. So let's research our model deeper.

In [991]:
res_acc_list_lr_acc = []
res_acc_list_lr_f1 = []

res_train_size = 125

train_data_xx = x[:res_train_size]
train_data_yy = y[:res_train_size]


XX_train = train_data_xx
YY_train = train_data_yy

res_test_batch = [10, 15, 25, 30, 75, 100, 125, 150, 175, 200, 225, 250, 300]

for res_test_size in res_test_batch:

    test_data_xx = x[:res_test_size]
    test_data_yy = y[:res_test_size]

    XX_test = test_data_xx
    YY_test = test_data_yy

    # Logistic Regression
    res_lr = LogisticRegression()
    res_lr.fit(XX_train, YY_train)

    res_acc_list_lr_acc.append(accuracy_score(res_lr.predict(XX_test),YY_test))
    res_acc_list_lr_f1.append(f1_score(res_lr.predict(XX_test), YY_test, average='binary'))
    
    print("Ttest with test size:", res_test_size)
    print("Logistic Regression calculated F-measure binary {:3.4f}".format(f1_score(res_lr.predict(XX_test), YY_test, average='binary')))
    print("Logistic Regression calculated accuracy: {:3.4f}%\n".format(accuracy_score(res_lr.predict(XX_test),YY_test)*100))
    
    print("-----------------------------------------------------------------")
Ttest with test size: 10
Logistic Regression calculated F-measure binary 0.6667
Logistic Regression calculated accuracy: 70.0000%

-----------------------------------------------------------------
Ttest with test size: 15
Logistic Regression calculated F-measure binary 0.7500
Logistic Regression calculated accuracy: 73.3333%

-----------------------------------------------------------------
Ttest with test size: 25
Logistic Regression calculated F-measure binary 0.7692
Logistic Regression calculated accuracy: 76.0000%

-----------------------------------------------------------------
Ttest with test size: 30
Logistic Regression calculated F-measure binary 0.8000
Logistic Regression calculated accuracy: 80.0000%

-----------------------------------------------------------------
Ttest with test size: 75
Logistic Regression calculated F-measure binary 0.8636
Logistic Regression calculated accuracy: 84.0000%

-----------------------------------------------------------------
Ttest with test size: 100
Logistic Regression calculated F-measure binary 0.8718
Logistic Regression calculated accuracy: 85.0000%

-----------------------------------------------------------------
Ttest with test size: 125
Logistic Regression calculated F-measure binary 0.8874
Logistic Regression calculated accuracy: 86.4000%

-----------------------------------------------------------------
Ttest with test size: 150
Logistic Regression calculated F-measure binary 0.8851
Logistic Regression calculated accuracy: 86.6667%

-----------------------------------------------------------------
Ttest with test size: 175
Logistic Regression calculated F-measure binary 0.8687
Logistic Regression calculated accuracy: 85.1429%

-----------------------------------------------------------------
Ttest with test size: 200
Logistic Regression calculated F-measure binary 0.8673
Logistic Regression calculated accuracy: 85.0000%

-----------------------------------------------------------------
Ttest with test size: 225
Logistic Regression calculated F-measure binary 0.8750
Logistic Regression calculated accuracy: 85.7778%

-----------------------------------------------------------------
Ttest with test size: 250
Logistic Regression calculated F-measure binary 0.8741
Logistic Regression calculated accuracy: 85.6000%

-----------------------------------------------------------------
Ttest with test size: 300
Logistic Regression calculated F-measure binary 0.8688
Logistic Regression calculated accuracy: 85.0000%

-----------------------------------------------------------------
In [992]:
plt.figure(figsize=(35, 20))

plt.title("test")

plt.plot(res_test_batch, res_acc_list_lr_acc, label='Logistic Regression Accuracy')
plt.plot(res_test_batch, res_acc_list_lr_f1, label='Logistic Regression F-measure')


plt.xlabel('test size')
plt.ylabel('accuracy')
plt.title('')
plt.legend()
plt.show()
In [993]:
print("Logistic Regression linear equation:")

res_y_string = "y = "
for i in range(13):
    res_y_string += str(round(res_lr.coef_[0][i], 4))+"*x"+str(i)+" + "

res_y_string += str(round(res_lr.intercept_[0], 4));
print(y_string)
Logistic Regression linear equation:
y = -0.1196*x0 + -0.8133*x1 + 0.7947*x2 + -0.2819*x3 + -0.1597*x4 + 0.1524*x5 + 0.2273*x6 + 0.6545*x7 + -0.4659*x8 + -0.5787*x9 + 0.4118*x10 + -0.8486*x11 + -0.7369*x12 + 0.0921

Let's test our model on such patient as:

  • Age (age in years): 58
  • Sex: male
  • CP (chest pain type): 0
  • TRESTBPS (resting blood pressure (in mm Hg on admission to the hospital)): 100
  • CHOL (serum cholestoral in mg/dl): 234
  • FPS (fasting blood sugar > 120 mg/dl): true
  • RESTECH (resting electrocardiographic results): 1
  • THALACH (maximum heart rate achieved): 156
  • EXANG (exercise induced angina) no // стенокардия, вызванная физической нагрузкой
  • OLDPEAK (ST depression induced by exercise relative to rest): 0.1
  • SLOPE (the slope of the peak exercise ST segment): 2
  • CA (number of major vessels (0-3) colored by flourosopy): 1
  • THAL (3 = normal; 6 = fixed defect; 7 = reversable defect): normal
  • In [1017]:
    acc_list_lr_acc = []
    acc_list_lr_f1 = []
    
    acc_list_dt_acc = []
    acc_list_dt_f1 = []
    
    acc_list_rf_acc = []
    acc_list_rf_f1 = []
    
    acc_list_knn_acc = []
    acc_list_knn_f1 = []
    
    acc_list_nb_acc = []
    acc_list_nb_f1 = []
    
    acc_list_svm_acc = []
    acc_list_svm_f1 = []
    
    train_batch = np.arange(15, 250, 15)
    
    test_pred_size = 15
    
    print(train_batch)
    
    for train_size in train_batch:
        
        train_data_x = x[:train_size]
        test_data_x = x[train_size:train_size + test_pred_size]
        
        train_data_y = y[:train_size]
        test_data_y = y[train_size:train_size + test_pred_size]
        
        print(x)
    
        X_train = train_data_x
        Y_train = train_data_y
        X_test = test_data_x
        Y_test = test_data_y 
    
        # Logistic Regression
        lr = LogisticRegression()
        lr.fit(X_train, Y_train)
        
        acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
        acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))
    
        # Decision Tree
        from sklearn.tree import DecisionTreeClassifier
        dt = DecisionTreeClassifier()
        dt.fit(X_train, Y_train)
        
        acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
        acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))
    
        # Random Forest
        from sklearn.ensemble import RandomForestClassifier
        rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
        rf.fit(X_train, Y_train)
        
        acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
        acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))
    
        # K-Nearest Neighbors
        from sklearn.neighbors import KNeighborsClassifier
        knn = KNeighborsClassifier(n_neighbors = 4)
        knn.fit(X_train, Y_train)
        
        acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
        acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))
    
        # Gaussian Naive Bayes
        from sklearn.naive_bayes import GaussianNB
        nb = GaussianNB()
        nb.fit(X_train, Y_train)
        
        acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
        acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))
    
        # C-Support Vector Classification
        from sklearn.svm import SVC
        svm = SVC(random_state = 1)
        svm.fit(X_train, Y_train)
        
        acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
        acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))
    
    [ 15  30  45  60  75  90 105 120 135 150 165 180 195 210 225 240]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    
    In [1018]:
    plt.figure(figsize=(35, 20))
    
    plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
    plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
    plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
    plt.plot(train_batch, acc_list_knn_acc, label='KNN')
    plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
    plt.plot(train_batch, acc_list_svm_acc, label='SVM')
    
    plt.xlabel('Batch size')
    plt.ylabel('Accurancy')
    plt.title('Dependency of accuracy on batch size, test size: 10, delta: 15')
    plt.legend()
    plt.show()
    

    Test size: 10, delta: 15.

    The best model is Logistic Regression. An appropriate batch size is about 105.

    The worst model is aso Decision Tree

    About 105 is enough to get 0.8 accurancy almost for all models

    In [1019]:
    acc_list_lr_acc = []
    acc_list_lr_f1 = []
    
    acc_list_dt_acc = []
    acc_list_dt_f1 = []
    
    acc_list_rf_acc = []
    acc_list_rf_f1 = []
    
    acc_list_knn_acc = []
    acc_list_knn_f1 = []
    
    acc_list_nb_acc = []
    acc_list_nb_f1 = []
    
    acc_list_svm_acc = []
    acc_list_svm_f1 = []
    
    train_batch = np.arange(15, 250, 15)
    
    test_pred_size = 30
    
    print(train_batch)
    
    for train_size in train_batch:
        
        train_data_x = x[:train_size]
        test_data_x = x[train_size:train_size + test_pred_size]
        
        train_data_y = y[:train_size]
        test_data_y = y[train_size:train_size + test_pred_size]
        
        print(x)
    
        X_train = train_data_x
        Y_train = train_data_y
        X_test = test_data_x
        Y_test = test_data_y 
    
        # Logistic Regression
        lr = LogisticRegression()
        lr.fit(X_train, Y_train)
        
        acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
        acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))
    
        # Decision Tree
        from sklearn.tree import DecisionTreeClassifier
        dt = DecisionTreeClassifier()
        dt.fit(X_train, Y_train)
        
        acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
        acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))
    
        # Random Forest
        from sklearn.ensemble import RandomForestClassifier
        rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
        rf.fit(X_train, Y_train)
        
        acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
        acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))
    
        # K-Nearest Neighbors
        from sklearn.neighbors import KNeighborsClassifier
        knn = KNeighborsClassifier(n_neighbors = 4)
        knn.fit(X_train, Y_train)
        
        acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
        acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))
    
        # Gaussian Naive Bayes
        from sklearn.naive_bayes import GaussianNB
        nb = GaussianNB()
        nb.fit(X_train, Y_train)
        
        acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
        acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))
    
        # C-Support Vector Classification
        from sklearn.svm import SVC
        svm = SVC(random_state = 1)
        svm.fit(X_train, Y_train)
        
        acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
        acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))
    
    [ 15  30  45  60  75  90 105 120 135 150 165 180 195 210 225 240]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    
    In [1020]:
    plt.figure(figsize=(35, 20))
    
    plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
    plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
    plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
    plt.plot(train_batch, acc_list_knn_acc, label='KNN')
    plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
    plt.plot(train_batch, acc_list_svm_acc, label='SVM')
    
    
    plt.xlabel('Batch size')
    plt.ylabel('Accurancy')
    plt.title('Dependency of accuracy on batch size, test size: 30, delta: 15')
    plt.legend()
    plt.show()
    

    Test size: 30, delta: 15.

    The best model is also Logistic Regression. An appropriate batch size is about 90 to get 87% accurancy.

    The worst model is Decision Tree

    About 90 is enough to get 0.8 accurancy almost for all models

    In [1021]:
    acc_list_lr_acc = []
    acc_list_lr_f1 = []
    
    acc_list_dt_acc = []
    acc_list_dt_f1 = []
    
    acc_list_rf_acc = []
    acc_list_rf_f1 = []
    
    acc_list_knn_acc = []
    acc_list_knn_f1 = []
    
    acc_list_nb_acc = []
    acc_list_nb_f1 = []
    
    acc_list_svm_acc = []
    acc_list_svm_f1 = []
    
    train_batch = np.arange(15, 240, 15)
    
    test_pred_size = 60
    
    print(train_batch)
    
    for train_size in train_batch:
        
        train_data_x = x[:train_size]
        test_data_x = x[train_size:train_size + test_pred_size]
        
        train_data_y = y[:train_size]
        test_data_y = y[train_size:train_size + test_pred_size]
        
        print(x)
    
        X_train = train_data_x
        Y_train = train_data_y
        X_test = test_data_x
        Y_test = test_data_y 
    
        # Logistic Regression
        lr = LogisticRegression()
        lr.fit(X_train, Y_train)
        
        acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
        acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))
    
        # Decision Tree
        from sklearn.tree import DecisionTreeClassifier
        dt = DecisionTreeClassifier()
        dt.fit(X_train, Y_train)
        
        acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
        acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))
    
        # Random Forest
        from sklearn.ensemble import RandomForestClassifier
        rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
        rf.fit(X_train, Y_train)
        
        acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
        acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))
    
        # K-Nearest Neighbors
        from sklearn.neighbors import KNeighborsClassifier
        knn = KNeighborsClassifier(n_neighbors = 4)
        knn.fit(X_train, Y_train)
        
        acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
        acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))
    
        # Gaussian Naive Bayes
        from sklearn.naive_bayes import GaussianNB
        nb = GaussianNB()
        nb.fit(X_train, Y_train)
        
        acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
        acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))
    
        # C-Support Vector Classification
        from sklearn.svm import SVC
        svm = SVC(random_state = 1)
        svm.fit(X_train, Y_train)
        
        acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
        acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))
    
    [ 15  30  45  60  75  90 105 120 135 150 165 180 195 210 225]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    
    In [1022]:
    plt.figure(figsize=(35, 20))
    
    plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
    plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
    plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
    plt.plot(train_batch, acc_list_knn_acc, label='KNN')
    plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
    plt.plot(train_batch, acc_list_svm_acc, label='SVM')
    
    
    plt.xlabel('Batch size')
    plt.ylabel('Accurancy')
    plt.title('Dependency of accuracy on batch size, test size: 60, delta: 15')
    plt.legend()
    plt.show()
    

    Test size: 60, delta: 15.

    The best model is also Logistic Regression. An appropriate batch size is about 90 to get 87% accurancy.

    The worst model is Decision Tree

    About 90 is enough to get 0.8 accurancy almost for all models

    In [1023]:
    acc_list_lr_acc = []
    acc_list_lr_f1 = []
    
    acc_list_dt_acc = []
    acc_list_dt_f1 = []
    
    acc_list_rf_acc = []
    acc_list_rf_f1 = []
    
    acc_list_knn_acc = []
    acc_list_knn_f1 = []
    
    acc_list_nb_acc = []
    acc_list_nb_f1 = []
    
    acc_list_svm_acc = []
    acc_list_svm_f1 = []
    
    train_batch = np.arange(15, 200, 15)
    
    test_pred_size = 100
    
    print(train_batch)
    
    for train_size in train_batch:
        
        train_data_x = x[:train_size]
        test_data_x = x[train_size:train_size + test_pred_size]
        
        train_data_y = y[:train_size]
        test_data_y = y[train_size:train_size + test_pred_size]
        
        print(x)
    
        X_train = train_data_x
        Y_train = train_data_y
        X_test = test_data_x
        Y_test = test_data_y 
    
        # Logistic Regression
        lr = LogisticRegression()
        lr.fit(X_train, Y_train)
        
        acc_list_lr_acc.append(accuracy_score(lr.predict(X_test),Y_test))
        acc_list_lr_f1.append(f1_score(lr.predict(X_test), Y_test, average='binary'))
    
        # Decision Tree
        from sklearn.tree import DecisionTreeClassifier
        dt = DecisionTreeClassifier()
        dt.fit(X_train, Y_train)
        
        acc_list_dt_acc.append(accuracy_score(dt.predict(X_test),Y_test))
        acc_list_dt_f1.append(f1_score(dt.predict(X_test), Y_test, average='binary'))
    
        # Random Forest
        from sklearn.ensemble import RandomForestClassifier
        rf = RandomForestClassifier(n_estimators = 27, random_state = 42)
        rf.fit(X_train, Y_train)
        
        acc_list_rf_acc.append(accuracy_score(rf.predict(X_test),Y_test))
        acc_list_rf_f1.append(f1_score(rf.predict(X_test), Y_test, average='binary'))
    
        # K-Nearest Neighbors
        from sklearn.neighbors import KNeighborsClassifier
        knn = KNeighborsClassifier(n_neighbors = 4)
        knn.fit(X_train, Y_train)
        
        acc_list_knn_acc.append(accuracy_score(knn.predict(X_test),Y_test))
        acc_list_knn_f1.append(f1_score(knn.predict(X_test), Y_test, average='binary'))
    
        # Gaussian Naive Bayes
        from sklearn.naive_bayes import GaussianNB
        nb = GaussianNB()
        nb.fit(X_train, Y_train)
        
        acc_list_nb_acc.append(accuracy_score(nb.predict(X_test),Y_test))
        acc_list_nb_f1.append(f1_score(nb.predict(X_test), Y_test, average='binary'))
    
        # C-Support Vector Classification
        from sklearn.svm import SVC
        svm = SVC(random_state = 1)
        svm.fit(X_train, Y_train)
        
        acc_list_svm_acc.append(accuracy_score(svm.predict(X_test),Y_test))
        acc_list_svm_f1.append(f1_score(svm.predict(X_test), Y_test, average='binary'))
    
    [ 15  30  45  60  75  90 105 120 135 150 165 180 195]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    [[ 0.62133012 -1.46841752 -0.93851463 ... -0.64911323  1.24459328
       1.12302895]
     [ 0.9521966  -1.46841752 -0.93851463 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.47415758  0.68100522  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     ...
     [-0.04040284 -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.1432911  -1.46841752  1.00257707 ... -0.64911323 -0.71442887
      -0.51292188]
     [-1.36386876  0.68100522 -0.93851463 ...  0.97635214 -0.71442887
      -0.51292188]]
    
    In [1024]:
    plt.figure(figsize=(35, 20))
    
    plt.plot(train_batch, acc_list_lr_acc, label='Logistic Regression')
    plt.plot(train_batch, acc_list_rf_acc, label='Random Forest')
    plt.plot(train_batch, acc_list_dt_acc, label='Decision Tree')
    plt.plot(train_batch, acc_list_knn_acc, label='KNN')
    plt.plot(train_batch, acc_list_nb_acc, label='Naive Bayes')
    plt.plot(train_batch, acc_list_svm_acc, label='SVM')
    
    
    plt.xlabel('Batch size')
    plt.ylabel('Accurancy')
    plt.title('Dependency of accuracy on batch size, test size: 100, delta: 15')
    plt.legend()
    plt.show()
    

    Test size: 100, delta: 15.

    The best model is also Logistic Regression. An appropriate batch size is about 90 to get 85% accurancy. It also for Naive Bayes. It's not so bad and got good results after batch > 150%.

    The worst model is also Decision Tree

    But here 0.8 accurancy the size is 150 almost for all models

    CONCLUSION

    As result we researched the medical Heart Disease UCI Dataset.

    What did we do and get?

    • Dataset columns feature explain;
    • Investigating the data (visualization, distribution etc.)
    • Correlation (pearsman, spearman, kandall)
    • Implement, research and compare 5 algorithms of classification:
      • Logistic Regression
      • Decision Trees
      • Random Forest
      • K-Nearest Neighbors
      • Naive Bayes
      • C-Support Vector Classification

      All results are explained in previous chapter.
    In [ ]: